Goto

Collaborating Authors

 Law Enforcement & Public Safety


A Details of the Experiments

Neural Information Processing Systems

A.1 Details of the Datasets Here we introduce the details of the datasets used in the experiments. Table 6: Dataset Summary COMPAS: COMPAS [16] is a dataset containing the criminal records of 6,172 individuals arrested in Florida. The task is to predict whether the individual will commit a crime again in 2 years. The probability predicted by the system will be used as a risk score. We use 13 attributes for prediction.


Long-form factuality in large language models

Neural Information Processing Systems

Large language models (LLMs) often generate content that contains factual errors when responding to fact-seeking prompts on open-ended topics. To benchmark a model's long-form factuality in open domains, we first use GPT-4 to generate LongFact, a prompt set comprising thousands of questions spanning 38 topics. We then propose that LLM agents can be used as automated evaluators for longform factuality through a method which we call Search-Augmented Factuality Evaluator (SAFE). SAFE utilizes an LLM to break down a long-form response into a set of individual facts and to evaluate the accuracy of each fact using a multi-step reasoning process comprising sending search queries to Google Search and determining whether a fact is supported by the search results. Furthermore, we propose extending F1 score as an aggregated metric for long-form factuality.


Did faulty drug tests taint parole hearings? California is reviewing hundreds of denials

Los Angeles Times

The California Department of Corrections and Rehabilitation is reviewing hundreds of state parole hearings to see if any inmates who were denied parole were rejected because of faulty drug tests. Nearly 6,000 drug tests in California prisons are believed to have yielded false positives between April and July last year, and attorneys for the Board of Parole are now conducting a review of inmate files to determine if any of them need to appear before the parole board again to be reconsidered, according to officials with CDCR. If any inmates were denied parole because of the faulty tests, they could be owed a new hearing before the parole board, said attorneys representing inmates affected by the defective drug tests. The review is already underway and will determine if "without the positive drug screening, there is sufficient evidence to support an incarcerated person's denial of parole," said CDCR spokesperson Emily Humpal in a statement. If there isn't enough evidence to support incarceration other than the drug test, a new hearing will be scheduled.


Order-Independence Without Fine Tuning

Neural Information Processing Systems

The development of generative language models that can create long and coherent textual outputs via autoregression has lead to a proliferation of uses and a corresponding sweep of analyses as researches work to determine the limitations of this new paradigm. Unlike humans, these'Large Language Models' (LLMs) are highly sensitive to small changes in their inputs, leading to unwanted inconsistency in their behavior. One problematic inconsistency when LLMs are used to answer multiple-choice questions or analyze multiple inputs is order dependency: the output of an LLM can (and often does) change significantly when sub-sequences are swapped, despite both orderings being semantically identical. In this paper we present Set-Based Prompting, a technique that guarantees the output of an LLM will not have order dependence on a specified set of sub-sequences. We show that this method provably eliminates order dependency, and that it can be applied to any transformer-based LLM to enable text generation that is unaffected by re-orderings. Delving into the implications of our method, we show that, despite our inputs being out of distribution, the impact on expected accuracy is small, where the expectation is over the order of uniformly chosen shuffling of the candidate responses, and usually significantly less in practice. Thus, Set-Based Prompting can be used as a'dropped-in' method on fully trained models. Finally, we discuss how our method's success suggests that other strong guarantees can be obtained on LLM performance via modifying the input representations.


96% of IT pros say AI agents are a security risk, but they're deploying them anyway

ZDNet

AI agents are being rapidly deployed within organizations even as they sow security fears, according to a new report from data governance firm SailPoint. Based on a global survey of more than 350 IT professionals, the report found that the widespread embrace of agents -- AI systems capable of formulating plans and taking action without human oversight -- is taking place within a security vacuum. Of IT pros who responded, 84% said their organizations already use agents internally, but just over half that number (44%) currently have policies in place to control the agents' behavior. Even more strikingly, 96% of respondents said they view agents as a security risk, yet 98% also said their employers plan to expand their use of agents in the coming year. Agents are the latest wave in a flood of innovation surrounding generative AI, which began in earnest following OpenAI's release of ChatGPT in late 2022.


T2VSafetyBench: Evaluating the Safety of Text-to-Video Generative Models

Neural Information Processing Systems

The recent development of Sora leads to a new era in text-to-video (T2V) generation. Along with this comes the rising concern about its safety risks. The generated videos may contain illegal or unethical content, and there is a lack of comprehensive quantitative understanding of their safety, posing a challenge to their reliability and practical deployment. Previous evaluations primarily focus on the quality of video generation. While some evaluations of text-to-image models have considered safety, they cover limited aspects and do not address the unique temporal risk inherent in video generation.


BendVLM: Test-Time Debiasing of Vision-Language Embeddings Walter Gerych 1 Eileen Pan

Neural Information Processing Systems

Vision-language model (VLM) embeddings have been shown to encode biases present in their training data, such as societal biases that prescribe negative characteristics to members of various racial and gender identities. VLMs are being quickly adopted for a variety of tasks ranging from few-shot classification to text-guided image generation, making debiasing VLM embeddings crucial. Debiasing approaches that fine-tune the VLM often suffer from catastrophic forgetting. On the other hand, fine-tuning-free methods typically utilize a "one-size-fits-all" approach that assumes that correlation with the spurious attribute can be explained using a single linear direction across all possible inputs.


A Multi-LexSum release

Neural Information Processing Systems

The authors are working on incorporating the script as part of the HuggingFace datasets library to further streamline the downloading and usage of Multi-LexSum. We include a similar instruction on the project website, https://multilexsum. github.io,


Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters Andy Zhou School of Information Sciences

Neural Information Processing Systems

Large Language Models (LLMs) are typically harmless but remain vulnerable to carefully crafted prompts known as "jailbreaks", which can bypass protective measures and induce harmful behavior. Recent advancements in LLMs have incorporated moderation guardrails that can filter outputs, which trigger processing errors for certain malicious questions. Existing red-teaming benchmarks often neglect to include questions that trigger moderation guardrails, making it difficult to evaluate jailbreak effectiveness. To address this issue, we introduce JAMBench, a harmful behavior benchmark designed to trigger and evaluate moderation guardrails. JAMBench involves 160 manually crafted instructions covering four major risk categories at multiple severity levels. Furthermore, we propose a jailbreak method, JAM (Jailbreak Against Moderation), designed to attack moderation guardrails using jailbreak prefixes to bypass input-level filters and a fine-tuned shadow model functionally equivalent to the guardrail model to generate cipher characters to bypass output-level filters. Our extensive experiments on four LLMs demonstrate that JAM achieves higher jailbreak success ( 19.88) and lower filtered-out rates ( 1/6) than baselines.


CLAVE: An Adaptive Framework for Evaluating Values of LLM Generated Responses

Neural Information Processing Systems

The rapid progress in Large Language Models (LLMs) poses potential risks such as generating unethical content. Assessing the values embedded in LLMs' generated responses can help expose their misalignment, but this relies on reference-free value evaluators, e.g.